Finalizing iterative methods for linear systems (minres, bicg, bicgstab)
Jacobi, Gauss-Seidel, SSOR methods as preconditioners
Incomplete LU for preconditioning, three flavours: ILU(k), ILUT, ILU2
Recall that to find eigenvalues of matrix of size $N\times N$ one can use, e.g. the QR algorithm.
However, in some applications matrix is so large, that we even can not store it exactly.
Typically such matrices are given as a black-box that is able only to multiply matrix by vector (sometimes even without access to matrix elements). This is what we assume today.
In this case the best we can do is to solve partial eigenvalue problem, e.g.
For simplicity we will consider the case when matrix is normal and thus has orthonormal basis of eigenvectors.
To find the smallest eigenvalue one may run power method for $A^{-1}$:
$$x_{i+1} = \frac{A^{-1}x_{i}}{\|A^{-1}x_{i}\|}.$$
To accelerate convergence shift-and-invert strategy can be used:
$$x_{i+1} = \frac{(A-\sigma I)^{-1}x_{i}}{\|(A-\sigma I)^{-1}x_{i}\|},$$
where $\sigma$ should be close to the eigenvalue we want to find.
In order to get superlinear convergence one may use adaptive shifts:
$$x_{i+1} = \frac{(A-R(x_i) I)^{-1}x_{i}}{\|(A-R(x_i) I)^{-1}x_{i}\|},$$
where $R(x_k) = \frac{(x_i, Ax_i)}{(x_i, x_i)}$ is Rayleigh quotient.
The method converges cubically for Hermitian matrices and quadratically for non-Hermitian case.
Matrices $(A- \sigma I)$ as well as $(A-R(x_i) I)$ are ill-conditioned if $\sigma$ or $R(x_i)$ are close to eigenvalues.
Thus, if you are not given e.g. LU factorization of such matrix you might face a problem.
In practice you can solve systems only with some accuracy. Recall also that condition number is an upper bound and is overestimated for cosistent rhs. So, even in RQ iteration letting the shift tend to the eigenvalue does not harm significantly the performance of the iterative methods.
If accuracy of solution of systems increases from iteration to iteration, superlinear convergence for RQ iteration can still be present, see Theorem 2.1. Otherwise, you will get linear convergence.
The block power method (also known as subspace iteration method or simultaneous vector iteration) is a natural generalization of the power method for several largest eigenvalues.
It looks as:
QR-decomposition plays role of normalization in the standard power method.
Moreover, orthogonalization prevents the columns of the $X_i$ from converging all to the eigenvector corresponding to the largest modulus eigenvalue.
import numpy as np
import matplotlib.pyplot as plt
import copy
%matplotlib inline
n = 100
k = 10
A = np.diag(1./(1. + np.arange(n))) # diagonal matrix with well-separated maximum eigenvalues
A_clustered = np.diag(1 - 1./(1. + np.arange(n))) # diagonal matrix with clustered maximum eigenvalues
def subspace_iter(A, Y0, num_iter=100):
Y0, _ = np.linalg.qr(Y0)
Y = Y0.copy()
Y_old = Y0.copy()
err = []
for i in range(num_iter):
X = A.dot(Y)
Y, _ = np.linalg.qr(X)
err.append(np.linalg.norm(Y_old - Y.dot(Y.T.dot(Y_old))))
Y_old = Y.copy()
return Y, err
Y0 = np.random.random((n, k))
Y, err = subspace_iter(A, Y0, num_iter=100)
Y, err_clustered = subspace_iter(A_clustered, Y0, num_iter=100) #np.diag((diagonal - sigma)**(-1))
plt.semilogy(err, label='Separated eigvals')
plt.semilogy(err_clustered, label='Clustered eigvals')
plt.xlabel('Number of iterations')
plt.ylabel('Error')
plt.legend(loc='best')
plt.show()
Before we go to advanced methods let us discuss the important concept of Ritz approximation.
Given subspace spanned by columns of unitary matrix $Q_k$ of size $N\times k$ we consider the projected matrix $Q_k^* A Q_k$.
Let $\Theta_k=\mathrm{diag}(\theta_1,\dots,\theta_k)$ and $S_k=\begin{bmatrix}s_1 & \dots & s_k \end{bmatrix}$ be matrices of eigenvalues and eigenvectors of $(Q_k^* A Q_k)$:
$$(Q_k^* A Q_k)S_k = S_k \Theta_k$$
then $\{\theta_i\}$ are called Ritz values and $y_i = Q_k s_i$ - Ritz vectors.
Note that they are not eigenpairs of the initial matrix $AY_k\not= Y_k \Theta_k$, but the following equality holds:
$$Q_k^* (AY_k - Y_k \Theta_k) = Q_k^* (AQ_k S_k - Q_k S_k \Theta_k) = 0,$$
so the residual for the Ritz approximation is orthogonal to the subspace spanned by columns of $Q_k$.
$\lambda_\min(A) \leq \theta_\min \leq \theta_\max \leq \lambda_\max(A)$. Indeed, using Rayleigh quotient:
$$\theta_\min = \lambda_\min (Q_k^* A Q_k) = \min_{x\not=0} \frac{x^* (Q_k^* A Q_k) x}{x^* x} = \min_{y\not=0:y=Q_k x} \frac{y^* A y}{y^* y}\geq \min_{y\not= 0} \frac{y^* A y}{y^* y} = \lambda_\min(A).$$
Obviously, $\lambda_\min (Q_k^* A Q_k) = \lambda_\min(A)$ if $k=N$, but we want to construct a basis $k\ll N$ such that $\lambda_\min (Q_k^* A Q_k) \approx \lambda_\min(A)$.
Similarly, $\theta_\max \leq \lambda_\max(A)$.
Thus, if a subspace $V$ approximates first $k$ eigenvectors, then one can use the Rayleigh-Ritz method:
The question is how to find a good subspace $V$.
The good choice for $V$ is the Krylov subspace.
Recall that in the power method we used only one Krylov vector
$$x_k = \frac{A^k x_0}{\|A^k x_0\|}.$$
In this case $\theta_k = \frac{x_k^* A x_k}{x_k^* x_k}$ is nothing but a Ritz value. Natural idea is to use a bigger Krylov subspace.
As a result we can find more eigenvalues (power method only gives $\lambda_\max$). Moreover,convergence of the eigenvalue corresponding to $\lambda_\max$ will be faster.
For Hermitian matrices from the Arnoldi relation we have
$$ Q_k^*AQ_k = T_k, $$
where $Q_k$ is orthogonal basis in the Krylov subspace generated by the Lanczos procedure and $T_k$ is triangular matrix.
According to the Rayleigh-Ritz method we expect that eigenvalues of $T_k$ approximate eigenvalues of $A$. This method is called the Lanczos method. For nonsymmetric matrices it is called the Arnoldi method and instead of tridiagonal $T_k$ we would get upper=Hessenberg matrix.
Let us show that $\theta_\max \approx\lambda_\max$.
Let us denote $\theta_1 \equiv \theta_\max$ and $\lambda_1 \equiv \lambda_\max$. Then
$$ \theta_1 = \max_{y\in \mathcal{K}_i, y\not=0}\frac{(y,Ay)}{(y,y)} = \max_{p_{i-1}} \frac{(p_{i-1}(A)x_0, A p_{i-1}(A)x_0)}{(p_{i-1}(A)x_0, p_{i-1}(A)x_0)}, $$
where $p_{i-1}$ is a polynomial of degree not greater than $i-1$ such that $p_{i-1}(A)x_0\not=0$.
Expand $x_0 = \sum_{j=1}^N c_j v_j$, where $v_j$ are eigenvectors of $A$ (form orthonormal basis).
Since $\theta_1 \leq \lambda_1$ we get $$ \lambda_1 - \theta_1 \leq \lambda_1 - \frac{(p_{i-1}(A)x_0, A p_{i-1}(A)x_0)}{(p_{i-1}(A)x_0, p_{i-1}(A)x_0)} $$ for any polynomial $p_{i-1}$. Hence $$ \lambda_1 - \theta_1 \leq \lambda_1 - \frac{\sum_{k=1}^N \lambda_k |p_{i-1}(\lambda_k)|^2 |c_k|^2}{\sum_{k=1}^N |p_{i-1}(\lambda_k)|^2 |c_k|^2} = $$ $$ = \frac{\sum_{k=2}^N (\lambda_1 - \lambda_k) |p_{i-1}(\lambda_k)|^2 |c_k|^2}{|p_{i-1}(\lambda_1)|^2 |c_1|^2 + \sum_{k=2}^N |p_{i-1}(\lambda_k)|^2 |c_k|^2} \leq (\lambda_1 - \lambda_n) \frac{\max_{2\leq k \leq N}|p_{i-1}(\lambda_k)|^2}{|p_{i-1}(\lambda_1)|^2 }\gamma, \quad \gamma = \frac{\sum_{k=2}^N|c_k|^2}{|c_1|^2} $$
Since the inequality holds for any polynomial $p_{i-1}$ we will choose a polynomial:
$$|p_{i-1}(\lambda_1)| \gg \max_{2\leq k \leq N}|p_{i-1}(\lambda_k)|.$$
This holds, e.g. for the Chebyshev polynomial on $[\lambda_n,\lambda_2]$. Thus, $\theta_1 \approx \lambda_1$ or more precisely (Paige-Kaniel error bound, check it!): $$ \lambda_1 - \theta_1 \leq \frac{\lambda_1 - \lambda_n}{T_{i-1}^2(1 + 2\mu)}\gamma, \quad \mu = \frac{\lambda_1 - \lambda_2}{\lambda_2 - \lambda_n}, $$ where $T_{i-1}$ is a Chebyshev polynomial.
import scipy as sp
import scipy.sparse
from scipy.sparse import csc_matrix, csr_matrix
import matplotlib.pyplot as plt
import scipy.linalg
import scipy.sparse.linalg
import copy
n = 40
ex = np.ones(n)
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr')
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
def lanczos(A, m):
n = A.shape[0]
v = np.random.random((n, 1))
v = v / np.linalg.norm(v)
v_old = np.zeros((n, 1))
beta = np.zeros(m)
alpha = np.zeros(m)
for j in range(m-1):
w = A.dot(v)
alpha[j] = w.T.dot(v)
w = w - alpha[j] * v - beta[j] * v_old
beta[j+1] = np.linalg.norm(w)
v_old = v.copy()
v = w / beta[j+1]
w = A.dot(v)
alpha[m-1] = w.T.dot(v)
A = np.diag(beta[1:], k=-1) + np.diag(beta[1:], k=1) + np.diag(alpha[:], k=0)
l, _ = np.linalg.eigh(A)
return l
# Approximation of the largest eigenvalue for different k
l_large_exact = sp.sparse.linalg.eigsh(A, k=99, which='LM')[0][0]
print('k=10, err = {}'.format(np.abs(l_large_exact - lanczos(A, 10)[0])))
print('k=20, err = {}'.format(np.abs(l_large_exact - lanczos(A, 20)[0])))
print('k=100, err = {}'.format(np.abs(l_large_exact - lanczos(A, 100)[0])))
The Lanczos vectors may loose orthogonality during the process due to floating-point errors, thus all practical implementations of it use restarts.
A very good introduction to the topic is given in the book of Golub and Van-Loan (Matrix Computations).
An alternative to this approach are the so-called preconditioned iterative methods that include:
Consider Rayleigh quotient $R(x) = \frac{(x,Ax)}{(x,x)}$. Then $$ \nabla R(x) = \frac{2}{(x,x)} (Ax - R(x) x), $$
so the simplest gradient descent method with a preconditioner $B$ reads
$$ x_{i+1} = x_{i} - \tau_i B^{-1} (Ax_i - R(x_i) x_i), $$
$$ x_{i+1} = \frac{x_{i+1}}{\|x_{i+1}\|}. $$
Typically $B\approx (A-\sigma I)$, where $\sigma$ is called shift.
The closer $\sigma$ to the required eigenvalue is, the faster the convergence.
Parameter $\tau_k$ is chosen to minimize the $R(x_{i+1})$ over $\tau_k$ (steepest descent method).
One can think of this minimization procedure as minimization in basis $V = [x_i, r_i]$, where $r_{i}=B^{-1} (Ax_i - R(x_i) x_i)$.
This results into the generalized eigenvalue problem $(V^*AV)\begin{bmatrix}1 \\ -\tau_i \end{bmatrix} = \theta (V^*V) \begin{bmatrix}1 \\ -\tau_i \end{bmatrix}$ (Rayleigh-Ritz procedure with no orthogonalization of $V$). Here $\theta$ is the closest to the required eigenvalue.
Theorem (Knyazev and Neymeyr)
Let
then
$$ \left|\frac{R(x_{i+1}) - \lambda_j}{R(x_{i+1}) - \lambda_{j+1}}\right| < \left[ 1 - (1-\gamma)\left(1 - \frac{\lambda_j}{\lambda_{j+1}}\right) \right]^2 \cdot \left|\frac{R(x_{i}) - \lambda_j}{R(x_{i}) - \lambda_{j+1}}\right| $$
To find, e.g. $k$ eigenvalues one can do a one step of PINVIT for each vector:
$$ x^{(j)}_{i+1} = x^{(j)}_{i} - \tau^{(j)}_i B^{-1} (Ax^{(j)}_i - R(x^{(j)}_i) x^{(j)}_i), \quad j=1,\dots,k $$
$$ x^{(j)}_{i+1} = \frac{x^{(j)}_{i+1}}{\|x^{(j)}_{i+1}\|}. $$
And then orthogonalize them using the QR-decomposition. However, it is better to use the Rayleigh-Ritz procedure:
LOPCG method
$$ x_{i+1} = x_{i} - \alpha_i B^{-1} (Ax_i - R(x_i) x_i) + \beta_i x_{i-1} , $$
$$ x_{i+1} = \frac{x_{i+1}}{\|x_{i+1}\|}. $$
is a superior to PINVIT method as it adds to basis not only $x_i$ and $r_i$, but also $x_{i-1}$.
However, this interpretation leads to an unstable algorithm as $x_{i}$ is becoming colinear to $x_{i-1}$ as the procedure converges.
Knyazev suggested an equivalent stable version, which introduces new vectors $p_i$ (conjugate gradients)
$$ p_{i+1} = r_{i} + \beta_i p_{i}, $$
$$ x_{i+1} = x_{i} + \alpha_i p_{i+1}. $$
One can check that $\mathcal{L}(x_{i},x_{i-1},r_{i})=\mathcal{L}(x_{i},p_{i},r_{i})$.
The stable version explains name of the method:
In standard CG method we would minimze Rayleigh quotient $R$ in the conjugate gradient direction $p_{i+1}$:
$$\alpha_i = \arg\min_{\alpha_i} R(x_i + \alpha_i p_{i+1}).$$
In the locally-optimal CG we minimize over two parameters:
$$\alpha_i, \beta_i = \arg\min_{\alpha_i,\beta_i} R\left(x_i + \alpha_i p_{i+1}\right) = \arg\min_{\alpha_i,\beta_i} R\left(x_i + \alpha_i (r_{i} + \beta_i p_{i})\right)$$
and we locally obtain more optimal solution. That is why the method is called locally optimal.
As for PINVIT coefficients $\alpha_i,\beta_i$ can be found by the Rayleigh-Ritz procedure.
In the block version similarly to PINVIT on each iteration we are given basis $V=[X^{(i)}_k,B^{-1}R^{(i)}_k, P^{(i)}_k]$ and use Rayleigh-Ritz procedure.
The overall algorithm:
Deflation technique which stops iterating converged eigestates can also be applied here.
The method also converges linearly, but faster than PINVIT.
Locally optimal preconditioned solver
Linear convergence
Preconditioner $(A-\sigma I)$ is not always good for eigenvalue problems
The next method (Jacobi-Davidson) has smart preconditioning and superlinear convergence (if systems are solved accurately)!
Jacobi-Davidson method is a very popular technique for solving eigvalue problems (not only symmetric!).
It consits of two key ingredients:
But we will derive it similarly to the original paper.
Jacobi not only presents the way to solve the eigenvalue problem by Jacobi rotations, but also proposed an iterative procedure. Let $x_j$ be the current approximation, and $t$ the correction:
$$A(x_j + t) = \lambda (x_j + t),$$
and we look for the correction $t \perp x_j$ (new orthogonal vector).
Then, the parallel part has the form
$$x_j x^*_j A (x_j + t) = \lambda x_j,$$
which simplifies to
$$R(x_j) + x^* _j A t = \lambda.$$
The orthogonal component is
$$( I - x_j x^*_j) A (x_j + t) = (I - x_j x^*_j) \lambda (x_j + t),$$
which is equivalent to
$$ (I - x_j x^*_j) (A - \lambda I) t = (I - x_j x^*_j) (- A x_j + \lambda x_j) = - (I - x_j x^*_j) A x_j = - (A - R(x_j) I) x_j = -r_j. $$
$r_j$ is the residual.
Since $(I - x_j x^*_j) t = t$, we can rewrite this equation in the symmetric form
$$ (I - x_j x^*_j) (A - \lambda I) (I - x_j x^*_j) t = -r_j.$$
Now we replace $\lambda$ by $R(x_j)$, and get the Jacobi correction equation:
$$ (I - x_j x^*_j) (A - R(x_j) I) (I - x_j x^*_j) t = -r_j. $$
Since $r_j \perp x_j$ this equation is consistent, if $(A - R(x_j) I)$ is non-singular.
Typically Jacobi equation is solved inexactly by the appropriate Krylov method.
Even inexact solution of Jacobi equation ensures (why?) that the correction $t$ is orthogonal to $x_j$, which is good for computations.
If this equation is solved exactly, we will get Rayleigh quotient iteration! Let us show that.
$$ (I - x_j x^*_j) (A - R(x_j) I) (I - x_j x^*_j) t = -r_j.$$
$$ (I - x_j x^*_j) (A - R(x_j) I) t = -r_j.$$
$$ (A - R(x_j) I) t - \alpha x_j = -r_j, \quad \alpha = x^*_j (A - R(x_j) I) x_j$$
$$ t = \alpha (A - R(x_j) I)^{-1}x_j - (A - R(x_j) I)^{-1}r_j,$$
Thus, since $(A - R(x_j) I)^{-1}r_j = (A - R(x_j) I)^{-1}(A - R(x_j) I)x_j = x_j$ we get
$$x_{j+1} = x_j + t = \alpha (A - R(x_j) I)^{-1}x_j,$$
which is Rayleigh quotient iteration up to normalization.
A popular preconditioner for solving Jacobi equation by Krylov method has the form
$$ \widetilde K = (I - x_j x^*_j) K (I - x_j x^*_j) $$
where $K$ is easily-inverted approximation of $(A - R(x_j) I)$.
We need to derive how to solve a system with $\widetilde K$ in terms of solving a system with $K$.
We already showed that equation
$$ (I - x_j x^*_j) K (I - x_j x^*_j) \tilde t = f $$
is equavelnt to
$$ \tilde t = \alpha K^{-1}x_j + K^{-1}f $$
The trick now is to forget about the value of $\alpha$ and find it from $\tilde t\perp x_j$ to maintain orthogonality:
$$ \alpha = \frac{x_j^*K^{-1}f}{x_j^* K^{-1}x_j} $$ Thus for each iteration of the Jacobi equation we calculate $K^{-1}x_j$ and then update only $K^{-1}f$ on each internal Krylov iteration
On each iteration of the method we expand a basis with new $t$.
Namely, $V_j = [v_1,\dots,v_{j-1},v_j]$, where $v_j$ is vector $t$ orthogonalized to $V_{j-1}$.
Then standard Rayleigh-Ritz procedure is used.
Historal fact: Initially subspace acceleration was used in the Davidson method.
However, instead of the Jacobi equation, equation $(\mathrm{diag}(A) - R(x_j)I)t = -r_j$ was used.
Davidson method was very popular in quantum chemistry computations.
If we want many eigenvectors, we just compute partial Schur decomposition:
$$A Q_k = Q_k T_k, $$
and then want to update $Q_k$ by one vector added to $Q_k$. We just use instead of $A$ the matrix $(I - Q_k Q^*_k) A (I - Q_k Q^*_k)$.
The correction equation can be solved only roughly, and JD method is often the fastest.
Fast Fourier transform
Structured matrices (Toeplitz, Circulants)